Computer program product for determination of remote adapter and/or node liveness

ABSTRACT

The determination of node and/or adapter liveness in a distributed network data processing system is carried out via one messaging protocol that can be assisted by a second messaging protocol which is significantly less susceptible to delay, especially memory blocking delays encountered by daemons running on other nodes. The switching of protocols is accompanied by controlled grace periods for needed responses. This messaging protocol flexibility is also adapted for use as a mechanism for controlling the deliberate activities of node addition (birth) and node deletion (death).

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of co-pending U.S. patent applicationSer. No. 11/049,397, filed Feb. 2, 2005, entitled “A Method for AddingNew Members to a Group by a Commit Message with Updated Membership Listto all Nodes on the Updated List”, as amended, by Chang et al., which isa divisional of U.S. patent application Ser. No. 09/850,809 filed on May8, 2001, the entirety of which are hereby incorporated herein byreference.

FIELD OF THE INVENTION

The present invention is directed to a method for determination ofadapter and node death in a distributed data processing system that iscapable of using messaging protocols which operate at different levels,with different priorities and/or with different characteristic responselimitations. A significant advantage of the present invention is asuperior resilience to false failure notifications caused by daemonblockage.

BACKGROUND OF THE INVENTION

The determination of adapter and node liveness lies at the heart of anyhighly available distributed data processing network in which the nodesare dividable into clusters which are typically employed to operate ondedicated applications. In order to provide high availability services,a cluster system should be able to determine which nodes, networks, andnetwork adapters in the system are working. The failure in any suchcomponent should be detected early and the resultant information passedon to a higher level software subsystem and, if possible, recoveryoperations should be initiated by a cluster recovery manager andapplication level software.

Determination of node, network, and network adapter liveness is oftenmade through the use of daemon processes running in each node of thedistributed system. Daemons run distributed protocols and exchangeliveness messages that are forced through the different network paths inthe system. If no such liveness messages are received within apredetermined interval then the sending node or network adapter isassumed not to be working (“dead”) by the others.

This method of liveness determination imposes real-time constraints forthe corresponding daemons: if a daemon gets delayed for any reason thismay result in the hosting node being falsely detected as dead—a “falsedown” event. False down events result in unnecessary, and often costly,recovery procedures which can disrupt the operations of the cluster.

Making daemons obey these real time constraints is often far fromtrivial, however, since the underlying operating system is seldomreal-time. Only real-time operating systems can guarantee finiteresponse times under any circumstances.

If the load on one of the nodes of the system is such that the physicalmemory needs greatly exceed the amount of memory present, heavy pagingstarts to occur, which occasionally leads to processes making littleprogress. In terms of the liveness determination daemon, these pagingoperations can operate to prevent it from sending liveness messages in atimely fashion.

Although some operating systems do provide primitives that allowprocesses to keep their pages from being “stolen” by other applications,in practice this solution is not perfect: either the primitives do notwork on the entire addressing space (for example, they may not work withshared libraries) or the operating system itself is often pageable.

Besides memory starvation, other causes are known to prevent processesfrom making adequate progress: high interrupt rate, which blocks anyprocess in the system from running, and the presence of high-priorityprocesses that monopolize CPU utilization.

Different approaches could be used in order to prevent these “falsedown” events caused by process blockage:

1) Increasing the threshold of the number of missing incoming livenessmessages before the remote entity is declared “down”;

2) Making the daemon as real time as possible, employing real-timescheduling priority and operating system primitives to prevent paging;and

3) Incorporating the code responsible for sending the liveness messagesinto the kernel.

The first method has the drawback that real failures take longer to bedetected, which for (real) failures may result in longer periods duringwhich the end-user service offered by the cluster is unavailable.

The second method is only partially effective. Not only does it requirethe use of multiple real-time primitives offered by the operatingsystem, but also careful design to avoid known causes of blocking, suchas communication and I/O. Still the operating system may be unable toguarantee that the process will always make progress.

The third method may produce good results, but at a sometimesprohibitive development cost, since code needs to be introduced into theoperating system kernel, which severely impairs portability andserviceability. A subtle problem with this approach is that it can onlyprovide “kernel liveness,” being ill-suited to detect situations wherethe kernel is able to run but not user programs. Under such situations,the node becomes useless and declaring it dead is likely to be thecorrect decision.

SUMMARY OF THE INVENTION

The present invention provides a mechanism which prevents dataprocessing nodes from being prematurely and/or erroneously declared asbeing “dead” in a connected network of such nodes. More specifically, inaccordance with a preferred embodiment of the present invention, thereis provided a method for determining the status of nodes and/or nodeadapters in a network of connected data processing nodes. The nodesexist as a group, and there is preferably provided a group leader withspecial functions and authorities. (However, it is noted that theexistence of a Group Leader is not essential for the operation of thepresent invention in its broadest scope.) Periodically, each node in thegroup sends a status message to a defined node in the group. This isreferred to as a heart beat or heart beat message or more informally asan “I am alive” message. If the nodes are connected in a ring topology,which is the preferred implementation herein, then each node sends itsheart beat message to its downstream neighbor in the ring. Thetransmitted message is directed to a daemon program running on thedesignated node (or nodes for non-ring topologies) for the purpose ofproviding responses to the “I am Alive” (heart beat) messages. Thedaemon does not respond by passing along the heart beat message to apredefined node or nodes, but rather, periodically sends its own heartbeat message to a designated recipient (or to designated recipients inthe event that a non-ring topology of node interconnection is employed).However, passing heart beat messages does provide an alternate, thoughunpreferred, method of providing this service. Each node periodicallydetermines whether or not a heart beat signal has been sent from adesignated sending node. If a certain number of heart beat signals arenot received as expected, it is possible that the daemon responsible forits transmission has been delayed due to memory constraint problems atthe sending node. Delay in the response of this daemon can arise from anumber of causes which can also include the situation in which otherapplications running on the node are assigned higher priorities ofexecution by the local operating system; this delay phenomenon is notsolely limited to memory constraint problems. To assure that the presentmechanism is “ahead of the game,” a second message is sent preferablybefore it is absolutely essential, that is, before other approaches, asdescribed above, would have already declared the node “dead.” Thissecond message is sent to the non-responding node. However, the secondmessage is not sent to the daemon, but rather, to programming running onthe message receiving nodes which does not possess the same memoryconstraint problems. In particular, the second message is preferablydirected to an operating system kernel portion for which priorityprocessing is available. Even more particularly, preferred embodimentsof the present invention employ the “ping” function as employed inUnix-based operating systems.

In accordance with another embodiment of the present invention, theprocess of adding nodes to a group or the process of handling a node“death” are also endowed with this two-level messaging mechanism. Thisassures that modification of the structure or topology of the group canbe changed in an efficient manner without the imposition of delays inmessage communication, particularly those caused by local memoryconstraint problems. In particular, the present invention also employs a“Prepare To Commit” (PTC) message which is processed in a similar manneras the above-described “I am Alive.” (IAA) message. While the use of theconcepts employed in the present invention can have the effect ofcutting down on some delays that can occur, the primary advantage isthat its use prevents nodes from prematurely being declared as “dead.”

Accordingly, it is an object of the present invention to provide asystem and method for determining the state of node “liveness” in adistributed data processing network.

It is also an object of the present invention to avoid the problem ofblocked daemons used in message processing requests which relate to nodestatus, particularly the status that reflects whether or not a node isalive and functioning.

It is yet another object of the present invention to be able toeliminate delays in system processing and overhead caused by thesituation in which nodes are prematurely declared as being “dead.”

It is a still further object of the present invention to processaddition of nodes to a network group and to also process removal ofnodes from a network group without unnecessary delay.

It is also an object of the present invention to eliminate messageresponse delays engendered by memory constraint problems in remotenodes.

It is yet another object of the present invention to insure full networksystem utilization and to avoid states in which full processparticipation by all of the desired nodes is lacking.

It is a still further object of the present invention to provide amechanism for detecting deaths of nodes and/or node adapters and forproviding a graceful reorganization of group membership.

Lastly, but not limited hereto, it is an object of the present inventionto take advantage of operating system kernel functions which are notencumbered by memory paging, memory allocation, or similar delay-causingconstraints.

The recitation herein of a list of desirable objects which are met byvarious embodiments of the present invention is not meant to imply orsuggest that any or all of these objects are present as essentialfeatures, either individually or collectively, in the most generalembodiment of the present invention or in any of its more specificembodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter which is regarded as the invention is particularlypointed out and distinctly claimed in the concluding portion of thespecification. The invention, however, both as to organization andmethod of practice, together with the further objects and advantagesthereof, may best be understood by reference to the followingdescription taken in connection with the accompanying drawings in which:

FIG. 1 is a block diagram illustrating the environment in which thepresent invention is employed, namely, within an interconnected networkof data processing nodes.

FIG. 2A is a signal flow graph illustrating the range of nodes employedin the transmission of a “proclaim” message;

FIG. 2B is a signal flow graph similar to FIG. 2A, but more particularlyillustrating the nodes involved in a “join” response to a “proclaim”message;

FIG. 2C is a signal flow graph similar to FIGS. 2A and 2B, but moreparticularly illustrating the transmission of “prepare to commit” (PTC)messages for the protocol involved in joining new nodes to a group;

FIG. 2D is a signal flow graph similar to FIG. 2C, but more particularlyillustrating the acknowledgment (PTC_ACK) of the “prepare to commit”message transmission.

FIG. 3A illustrates the nodes that are involved in a “commit” broadcastmessage.

FIG. 3B is a signal flow graph similar to FIG. 3A, but more particularlyillustrating the range of nodes in the transmission of a “commit”message.

FIG. 4 is a node topology graph illustrating the ring connection whichexists subsequent to the formation of a new group of nodes;

FIG. 5A is a signal flow graph illustrating the direction of flow ofheart beat messages in a node group;

FIG. 5B is a signal flow graph illustrating similar to FIG. 5A, but moreparticularly illustrating “death” message transmission in the event of anode death;

FIG. 5C is a signal flow graph illustrating similar to FIG. 5A and FIG.5B, but more particularly illustrating the range of nodes to which“prepare to commit” messages are sent;

FIG. 6 is a timeline diagram illustrating the concept that several heartbeat messages are missed before lower level and/or higher priority echorequest messages are sent;

FIG. 7 is an event timeline diagram which particularly illustratesmessage transmission in the situation in which the Topology Servicesdaemon on Node 2 is temporarily blocked;

FIG. 8 is an event timeline illustrating message transmission in thesituation in which Node 2 dies; and

FIG. 9 is an event timeline illustrating the transmission of “prepare tocommit” and corresponding acknowledgment messages.

BEST MODE FOR CARRYING OUT THE INVENTION

The present mechanism is provided to prevent “false downs” that resultfrom daemon blockage under stress loads and under other conditions thatcause processes to be blocked. The mechanism uses Internet ControlMessage Protocol (ICMP) echo request messages, which are sent to thenode/adapter which is suspected of being down. Since such messages areresponded to by the kernel in interrupt mode, they are responded to evenif the peer daemon is temporarily blocked. If an ICMP echo-replyresponse is received from the “suspected-dead” node or adapter, then agrace period is established for it, and the node or adapter is notdeclared dead, at least initially.

The present mechanism offers advantages over the three approachesitemized in the list given above. Unlike the first alternate approachsuggested above, the detection time for real adapter and/or nodefailures is not increased. The present mechanism is also more effectivethan the second alternate approach given above, since the operatingsystem kernel is more likely to be able to respond quickly to an ICMPecho request message than allowing a user-level process to run. Finally,the present mechanism does not require writing kernel code as would thethird alternate approach from above.

The mechanism proposed herein is introduced in the context of its use inthe Topology Services subsystem, which is part of IBM's ReliableScalable Cluster Technology (RSCT) infrastructure. Topology Servicesprovides the “liveness layer” of the system since it is responsible fordetermining the set of working nodes, networks, and network adapters inthe cluster.

Heart Beat Protocols

To better explain the mechanism, and how it is employed in TopologyServices, the adapter membership (“heartbeating”) protocols in thesubsystem are now described in more detail.

In order to monitor the health and connectivity of adapters in eachnetwork, all adapters in the network attempt to form an “AdapterMembership Group” (AMG), which is a group which contains all networkadapters that can communicate with each other in the network. Adaptersin an AMG monitor the “liveness” of each other. When an AMG is formed,all group members receive an “AMG id” which uniquely identifies the AMG.Adapters that fail are expelled from the group, and new adapters thatare powered up are invited to join the group. In both cases, a new AMGwith a new AMG is formed. Each AMG has one member that is the GroupLeader (GL), and all members know which node is the Group Leader. Notethat a node may belong to several AMGs, one for each of its networkadapters.

To determine the set of adapters that are alive in each network, anadapter membership protocol is run in each of the networks. Messages inthis protocol are sent using UDP/IP (User Datagram Protocol/InternetProtocol). While this protocol is referred to as an Internet Protocol,it should be noted that the use of this term herein does not imply theexistence of any Internet connections nor does it imply dependence onthe Internet in any way. It is simply the name of a conveniently used,well-characterized communication protocol usable within a connectednetwork of data processing nodes.

Adapters that are alive form an AMG, where members are preferablyorganized in a virtual ring topology. To ensure that all group membersare alive, each member periodically sends “heart beat” messages to its“downstream neighbor” and monitors “heart beat” messages from its“upstream neighbor.” “Heart beat” messages are also referred to hereinas “I am Alive.” (AYA) messages. Death Protocols and Join Protocols,respectively, are run when adapters fail or when new adapters becomefunctional. The goal of such protocols is to guarantee that themembership group contains at each moment all (and only) the adapters inthe network (but only those belonging to the cluster) that cancommunicate with each other.

Besides the Group Leader, each group has a “Crown Prince” (backup groupleader). See U.S. Pat. Nos. 5,805,786 and 5,926,619 for a description ofthe “Crown Prince” model. The group leader is responsible forcoordinating the group protocols, and the crown prince is responsiblefor taking over the group leadership if the group leader adapter fails.Both the choice of group leader and crown prince, and the position ofthe adapters in the ring, are determined by a predefined adapterpriority rule, which herein is preferably chosen to be the adapters' IPaddress. This address provides a conveniently available and uniqueidentifier all of whose characteristics make it highly suitable for thisrole.

A list of all possible adapters in each network is contained in aconfiguration file that is read by all of the nodes at startup and atreconfiguration time.

Join Protocol

In order to attract new members to the group, the Group Leader in eachgroup periodically sends “PROCLAIM” messages to adapters that are in theadapter configuration but do not currently belong to the group. Themessage is only sent to adapters having a lower IP address than that ofthe sender. It is noted that, while the use of IP addresses is thepreferred mechanism for properly directing “PROCLAIM” messages, anyother convenient method is also applicable; the only requirement istransmission to a well-defined set of nodes having a Group Leader.

The “PROCLAIM” messages are ignored by all adapters that are not groupleaders. A group leader node receiving a “PROCLAIM” message from ahigher priority (higher IP address) node responds with a “JOIN” messageon behalf of its group. The message contains the membership list of the“joining group.”

A node, say GL1, upon receiving a “JOIN” message from another node, sayGL2, attempts to form a new group containing the previous members plusall members in the joining group. Node GL1 then sends a “Prepare ToCommit” (PTC) message to all members of the new group, including nodeGL2.

Nodes receiving a “Prepare To Commit” message reply with a “PTC_ACK”(Prepare To Commit Acknowledgment) message. All of the nodes from whicha “PTC_ACK” message was received are included in the new group. Thegroup leader (node GL1) sends a “COMMIT” message, which contains theentire group membership list, to members of the newly formed group.

Receiving a “COMMIT” message marks the transition to the new group,which now contains the old members plus the joining members. Afterreceiving this message, a group member starts sending “heart beat”messages to its (possibly new) downstream neighbor and starts monitoring“heart beat” messages from its (possibly new) upstream neighbor.

Both “Prepare To Commit” and “COMMIT” messages require an acknowledgmentto ensure they were received. If no acknowledgment is received, then afinite number of retries is made. Failure to respond to a “Prepare ToCommit” message, after all retries have been exhausted, results in thecorresponding adapter not being included in the new group. If a daemonfails to receive a “COMMIT” message after all retries of the “PTC_ACK”message, then the local adapter gives up the formation of the new groupand reinitializes itself into a singleton group. This phenomenon shouldonly occur in the relatively rare case where the Group Leader fails inthe short window between sending the “Prepare To Commit” and “COMMIT”messages.

When the Topology Services daemon is initialized, it forms a singletonadapter group (of which the node is the group leader) in each of itsadapters. The node then starts sending and receiving “PROCLAIM”messages.

Death Protocol

A node or adapter monitors “heart beat” messages coming from its“upstream neighbor” (the adapter in the group that has the next highestIP address among the group members). When no “heart beat” messages arereceived for a predefined period of time, the “upstream neighbor” isassumed to have failed. A “DEATH” message is then sent to the GroupLeader, requesting that a new group be formed.

Upon receiving a “DEATH” message, the group leader attempts to form anew group containing all adapters in the current group except theadapter that was detected as failed. The Group Leader sends a “PrepareTo Commit” message to all members of the new group. The protocol thenfollows the same sequence as that described above for the Join protocol.

After sending a “DEATH” message, the daemon expects to shortly receive a“Prepare To Commit” message. A number of retries is attempted, but if no“Prepare To Commit” message is received, then the interpretation is thatthe Group Leader adapter (or its hosting node) is dead and that the“Crown Prince” adapter is also dead and, therefore, was unable to takeover the group leadership. In this case, the adapter reinitializesitself into a singleton group and also sends a “DISSOLVE” message,inviting all group members to do the same. This is the mechanism thatallows all members of the group to find out about the simultaneousdemise of the Group Leader and the Crown Prince members.

Basic Mechanisms

Once the AMG is formed, “heart beat” messages are preferably sent in aring-like topology, with “downstream neighbors” monitoring periodic“heart beat” messages sent by the “upstream neighbor.” One or more ofthe claims herein also refer to the “heart beat” message as a “firstmessage.” The downstream neighbor periodically checks to see whether a“heart beat” message was recently received from its upstream neighbor.If no message was received since the last check, then a “MissedHeartbeat” counter is incremented. If the Missed Heartbeat counterreaches a predetermined threshold value S (the “sensitivity”), then lesssophisticated protocols would consider the remote adapter dead andreport its demise.

It is also possible, while keeping within the scope and spirit of thepresent invention, to employ topologies that are not in the form of aring. Any convenient topology may be employed. However, the ringtopology is preferred since it is simple to implement and since itevinces greater scalability when the number of nodes is increased. Othertopologies require that a description of the structural links is alsocommunicated to the members of a group or prospective group member alongwith the communication of the member list, as described elsewhereherein. While this is generally an undesirable complication, it is,nonetheless, still possible without deviating from the broad principlesupon which the present invention is based.

However, the protocol herein is changed so that when the counter reachesa value X (smaller than S), then an ICMP (Internet Control MessageProtocol) echo request packet is sent to the adapter being monitored. Ifthe remote node and adapter are alive, then the destination OS kernel,and most preferably its interrupt handler, replies with an ICMP“echo-reply” message, even if the peer daemon is blocked. The procedureis repeated when the counter reaches X+1, etc. If an ICMP “echo-reply”message is received from the monitored adapter, then this is interpretedas “the adapter being monitored is probably functioning, but that thecorresponding daemon may be either blocked or dead.” Since there is noimmediate way of knowing what is happening to the other side, a graceperiod is established. The missed heartbeat counter is allowed to gopast S until a value S1, which is significantly larger than S, isreached. At that time, if no “heart beat” message is received from themonitored adapter, then the adapter is finally declared dead.

If a “heart beat” message is again received at some point in the countbetween X and S1, then the grace period is deactivated, and the counteris reset to zero. The goal of the grace period is to account for theremote daemon being blocked due to memory starvation or some otherfactor. If the remote adapter or node is indeed dead, then no ICMP“echo-reply” packet should be received, and, therefore, no grace periodis established. Consequently, there should not be a concern that a valid“adapter dead event” is delayed by the grace period. The only occasionwhere such delay occurs is when the corresponding daemon dies or getsindefinitely blocked, which should be a comparatively less commonproblem than a daemon being temporarily blocked by an excessive loadproblem.

The value of S1 is chosen to account for the “maximum reasonable time”that the daemon may be blocked in a system under large load.

On different “flavors” of Unix systems, sending and receiving ICMPmessages requires a program to open a “raw socket.” The raw socket'sbehavior is such that all of the ICMP packets received by the localadapter are given to each program that opens a raw socket. This mayresult in significant CPU resources being spent to process thesepackets, many of which could be unrelated to the “echo” message sent. Toalleviate this problem, the “raw socket” is only kept open while it isbeing decided whether to apply the grace period. If no incoming “heartbeats” are being missed or if a grace period is already in place, thenthe raw socket is closed.

Grace Period for Prepare To Commit (PTC) Message

Like the basic heartbeating mechanism, the re-forming of a group alsohas real-time constraints: if a node fails to respond to “Prepare ToCommit” packets in a timely fashion, that is, until the Group Leader hasgiven up waiting for “Prepare To Commit_ACK” messages, then thecorresponding adapter may be declared dead by the Group Leader.Therefore, a mechanism is desired for the case in which the daemon getsblocked while sending or responding to “Prepare To Commit” messages.

A similar “ping & grace period” mechanism is introduced to account for anode whose daemon gets blocked while responding to a “Prepare To Commit”message. If the Group Leader does not get any response from an adaptereven after all retries, then the Group Leader sends an ICMP (InternetControl Message Protocol) echo request message to the adapter. If an“echo-reply” message is received, then the Group Leader infers that theremote daemon may be blocked and establishes a grace period for it.However, this mechanism alone presents a problem: all of the otheradapters (which do not know about the grace period) could “time-out”while waiting for the “COMMIT” message and then give up on the newgroup. To counteract this problem, the other adapters also apply the“ping & grace period” mechanism on the Group Leader. As long as theGroup Leader node responds to the ICMP echo request messages and thegrace period has not expired, the other adapters remain waiting for the“COMMIT” message. Note that the grace period implemented by non-GroupLeader nodes also handles the situation where the daemon at the GroupLeader node becomes blocked itself. The amount of grace period appliedby the non-Group Leader nodes takes into account the “Prepare To Commit”retries, where different nodes may get their “Prepare To Commit”messages in different retries.

The same method described above is also used for a daemon that sends a“DEATH” message. An ICMP echo request message is sent to the GroupLeader in the case that a “Prepare To Commit” message takes too long toarrive (probably because the daemon at the Group Leader node isblocked).

What the mechanisms above achieve, without creating new protocolmessages, is the ability to include an adapter in a new AMG even if thecorresponding daemon gets blocked while running the protocol.

Attention is now directed to the figures which provide a complementarydescription of the structure, environment, and operation of the presentinvention. In particular, FIG. 1 is useful for appreciating theenvironment in which the invention is used. The environment comprises aplurality of data processing nodes (100, 101, 102, 103) interconnected(typically through multiple paths, as shown) to one another by means ofnetwork adapters (10 through 117). The exemplary nodes in which theinvention is employed is the IBM p-Series of server products, formerlyreferred to as the RS/6000 SP (for Scalable Parallel). Each nodetypically includes at least one central processing unit, shared memory,local memory cache, and connections to an included non-volatile storagedevice, typically a hard disk DASD unit. An exemplary operating systemfor each node is the AIX Operating System, as supplied by the assigneeof the present invention. Each node is capable of running with its ownoperating system which may or may not be AIX. AIX is a UNIX-like systemand supports echo requests based on commands such as “ping” which aredirected at the kernel (or core) of the operating system and whichoperates at a basically low level to provide fundamental “are you there”kinds of services. Clearly, communication problems can result when thereis a node failure or even a failure of one of the network adapters.

The discussion above refers to the transmission of a periodically issued“proclaim” message. FIG. 2A illustrates the case where there are twonode groups (nodes 200, 201 and 202 being one group; nodes 300, 301 and302 being the other group). Nodes 200 and 300 are the Group Leaders oftheir respective groups. A “proclaim” message is sent from Group Leader#1 (node 200) in the rightmost group shown. This message need not besent to nodes within its own current group. It is transmitted to allknown nodes in the network, but such messages are only responded to byGroup Leaders, hence the use of solid and dashed lines showingtransmission of the “proclaim” message in FIG. 2A. Only Group Leadersrespond to the “proclaim” message. The response, when a Group Leaderwants to join an existing group, is the transmission of a “join” messageto the sending Group Leader. See FIG. 2B. The protocol for joining agroup also includes the transmission of a “prepare to commit” message toall of the nodes involved, as shown in FIG. 2C. The normal response tothe PTC signal is the transmission of a message which acknowledgesreceipt of the PTC message, that is, the transmission of the PTC_ACKsignal from the same nodes as shown in FIG. 2C. The transmission of thislatter signal is shown more particularly in FIG. 2D.

FIG. 3A illustrates the use of a “commit broadcast” message transmissionfrom Group Leader 200 to “Mayor Node” 202 for the processing of furthertransmissions regarding “commit” operations. The concept of a Mayor Nodeis that it is preferably used to off load some of the communicationsburden from the Group Leader. This is more particularly shown in FIG. 3Bwhich also illustrates the transmission of a “commit broadcastacknowledgment” message back to Group Leader 200 from Mayor Node 202.FIG. 3B also illustrates the transmission protocol for the “commit”message. In particular, the use of a “mayor” node is shown as beingemployed as a helpful mechanism to off load work from the Group Leader,particularly communication burdens. Typically, there is one “mayor” oneach subnet, and a mayor is responsible for relaying certain assignedmessages from the group leader to its subnet. For some messages thatneed to go out to everyone, the group leader selects a Mayor Node fromeach subnet and sends the message point-to-point to such mayors. EachMayor Node then sends the message broadcast or point-to-point (dependingon the message type and the size of the group) to each adapter on itssubnet. The group-leader-to-mayor and mayor-to-subnet messages aresubject to acknowledgment and retry. If a Mayor Node fails toacknowledge a message, then the group leader selects a new mayor on thefailed mayor's subnet and repeats the process. Not every message thatneeds to reach everyone is sent using the Mayor Node as an intermediary.For example, the prepare-to-commit (PTC) message is sent point-to-pointfrom a “group-leader-want-to-be” to each and every potential groupmember. No Mayors Nodes are employed for this variety of messagetransmission.

At the end of the protocol established for adding nodes, the GroupLeader 200 organizes all of the nodes in the new group into a newtopology. As seen in FIG. 4, this topology is preferably a ring-topologyfor the reasons indicated above (simplicity and scalability, inparticular). In preferred embodiments of the present invention, the newGroup Leader is selected to be the node with the highest IP address.However, any other convenient mechanism may also be employed for thispurpose, as long as the process results in a unique selection. Forexample, the node that has been operative for the longest time may beselected. It is also noted that node death will also result in theformation of a new group topology. The same mechanisms that areemployable with the construction of a new topology for node and/or groupaddition are also employable in the event of node death. However, in thecase of node death, the ring-topology situation is particularly easy tostructure with a new topology which, in effect, simply bypasses thedefunct node.

One of the underlying operations that the present invention employs isthe heart beat mechanism described above. This is also illustrated morepictorially in FIGS. 5A, 5B, and 5C. FIG. 5A illustrates the passage ofthe heart beat message around a ring-topology group. FIG. 5B illustratesthe occurrence of node or adapter failure at node 201. Since the heartbeat message is periodic and can be expected at certain times, itspresence at node 300 is missed. When this occurs, a node “death message”is sent from node 300 to Group Leader 200. Group Leader 200 thenresponds by attempting to form a new group without deceased node 201. Todo so, a “prepare to commit” (PTC) message is sent to all of theremaining nodes in the former group, including, in effect, a PTC messageto node 200, the Group Leader. New group formation then proceeds in themanner described above.

FIG. 6 illustrates pictorially the initial receipt of a heart beatmessage at Node #1 from Node #2. If a predetermined number of suchmessages are not received, an ICMP echo request message is sent fromNode #1 to Node #2. Such echo request messages are directed to portsand/or software for which priority handling is available. This prioritycan be provided since such messages are typically designed to be verysimple and to not require any significant amount of processor time orresources on the receiving end (or even on the transmitting end). Suchmessages are simple and are replied to quickly whenever it is possible.Thus, even if certain expected messages are held up due to events suchas memory constraint problems, the ICMP echo request and the ICMP echoreply are handled quickly, thus preventing the premature and inaccuratereporting of a node death, when in fact, the node is only suffering a“short illness.”

FIG. 7 illustrates the exchange of heart beats and messages in the eventthat there is not just “short term node illness,” but a genuine node oradapter failure. In particular, it is seen that, in general, the methodtolerates a certain number of missed heart beat messages. When a certainnumber of heart beat messages have been missed (preferably 2), severalattempts are made to transmit an ICMP echo request message. If after acertain number of such attempts (preferably 3), Node #2 is declareddead.

FIG. 8 is similar to FIG. 7, but it illustrates the advantages of thepresent invention that occur when a node is not really “dead,” but onlysuffering from a short-term problem. In such cases, the questionablenode is sent an ICMP echo request message, and a response to thismessage is sent to the transmitting node (Node #1 here). This permitsthe establishment of a grace period for the re-establishment of a heartbeat message. In the example shown, the threshold is extended from 5 to8 such heart beats. Here, the short term illness is caused by theblocking of the daemon on Node #2, probably caused by memory constraintproblems. However, there could be other causes as well for which thepresent method provides performance and stability benefits.

FIG. 9 illustrates the extension of the above method from heart beatmessages to the transmission of “prepare to commit” (PTC) messages. Inthis example, it is assumed that there are only three nodesparticipating (for graphical simplicity): Node #1, Node #2 and Node #3.The sequence begins with the transmission of a “prepare to commit”message from Node #1 (to all of the other nodes) which also starts a“PTC retry timer” as a mechanism for gauging when responses to the PTCmessage are expected. The lack of an acknowledgment message from Node #3(due to a blocked daemon as opposed to a dead node or adapter) triggersthe transmission to Node # 3 of a second PTC message. The lack ofresponse to the PTC retry #1 message triggers the transmission of thelower level ICMP echo-request to Node #3. Since the lack of response tothe PTC messages is due only to temporary daemon blockage, Node #3 isstill able to respond to the echo request message with an ICMPecho-reply message sent to Node #1. The receipt of the ICMP echo-replymessage at Node #1 means that Node #3 is only temporarily blocked andnot “dead.” This permits the time for the acknowledgment of the PTCmessage to be extended in spite of the event of daemon blockage at Node#3. In this example, the PTC message is sent to Node #2 and Node #3.Node #2 responds with a PTC_ACK (acknowledgment) message. The lack ofresponse from Node #3 to the original PTC message triggers thetransmission of an ICMP echo-request message to Node #3. As describedabove, this message is one that is simpler, more direct, and is one thatis much more likely to be responded to in the event that the node andits adapter unit(s) are alive but otherwise busy. While this messageexchange is taking place primarily between Node #1 and Node #3, in themeantime Node #2 has acknowledged the original PTC message and iswaiting for a “commit” message from Node #1. If Node #2 has not receivedthe “commit” message and the time for its expected arrival has elapsed,Node #2 also assumes there is a problem and preferably also sends anICMP echo-request message to Node #1. In the illustration shown in FIG.9, it is seen that a response to the ICMP echo-request message sent fromNode #2 (namely, an ICMP echo-reply message) is received at Node #1,thus causing the normal deadline for receiving a commit message to beextended in substantially the same manner as described above, with thesimilar default preferences for recognition and extension. This assuresthat all Nodes ultimately receive and respond to a commit message.

From the above, it should be appreciated that all of the stated objectsare achieved in one or more embodiments of the present invention. Itshould also be appreciated that there is provided a mechanism fordetecting remote adapter and/or node failure which is based onrelatively low priority messaging for most of its functioning, but whichis capable of reverting to a lower level, higher priority messagingsystem using echo request messages in conjunction with a grace period asan alternate mechanism. More particularly, it is seen that the presentinvention provides a mechanism for dealing with blocked daemons in theexecution of a distributed protocol. It is furthermore seen that thepresent invention also does not require introducing new messages intothe protocol.

While the invention has been described in detail herein in accordancewith certain preferred embodiments thereof, many modifications andchanges therein may be effected by those skilled in the art.Accordingly, it is intended by the appended claims to cover all suchmodifications and changes as fall within the true spirit and scope ofthe invention.

1. An apparatus for data processing comprising: a connected network ofdata processing nodes having operating systems for controlling saidnodes together with programs at each node for controlling theinterconnection of said nodes into groups of nodes; first program meanswithin one of said nodes for periodically sending from said one node, toat least one other node in said network, a first message which is alsoexpected to be periodically received at said at least one other node,said first message being sent by a daemon program running on said onenode; and second program means within said at least one other node forsending a second message to said one node after having failed to receivesaid first message within a certain number of said periods, said secondmessage being directed to an other program running on said one node,said other program being less susceptible than said first program meansto being delayed.